{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 17 - Normal distributions.\n", "\n", "The normal distribution is also known as the bell curve or the Gaussian distribution. We will look at several of its properties in this lab.\n", "\n", "First, let's import the necessary libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a function in `numpy` for sampling from the normal distribution, which has the form:\n", "`np.random.normal(loc=0.0, scale=1.0, size=None)`\n", "\n", "This function has three optional parameters:\n", "- `loc` (short for location) is the mean of the distribution and where it will be centered. The default value is 0.\n", "- `scale` is the standard deviation or spread of the distribution, and controls how wide the distribution will be. The default value is 1.\n", "- `size` is the number of samples to take, and the default is 1.\n", "\n", "For example, `np.random.normal(loc = -5.4, scale = 2.8, size = 10)` will simulate 10 samples from a normal distribution with mean -5.4 and standard deviation 2.8. Try it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you take 5 samples from a normal distribution with a mean of 2.5 and standard deviation of 0.4?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's take 10,000 samples from the *standard normal distribution*, which has mean 0 and standard deviation 1 and save the array of samples as a variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer\n", "\n", "sample = np.random.normal(size =10000)\n", "\n", "
\n", "\n", "Display your variable to make sure your code worked correctly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To visualize this sample, let's plot it as a histogram. Remember to convert the numpy array into a pandas Series first." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer\n", "\n", "pd.Series(sample).hist(bins = 40)\n", "
\n", "\n", "Can you change the y axis to show the density instead of the frequency?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer\n", "\n", "pd.Series(sample).hist(bins = 40, density = True)\n", "
\n", "\n", "How does this compare to the standard normal distribution (second graph on [this page](https://www.inferentialthinking.com/chapters/14/3/SD_and_the_Normal_Curve.html))?\n", "\n", "What's the probability that a number sampled from the standard normal distribution is in between -1 and 1? We can estimate this probability from our sample by counting the number of values between -1 and 1, and dividing by the size of our sample.\n", "\n", "#### Step 1: Count the number of samples between -1 and 1\n", "\n", "You can use filters with a numpy array the same way as with a pandas Series (or you can convert the numpy array into a pandas Series first). `.shape` and `.sum()` will also work with numpy arrays. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer\n", "\n", "geq_1_filter = sample >= -1\n", "leq_1_filter = sample <= 1\n", "(geq_1_filter & leq_1_filter).sum()\n", "
\n", "or\n", "
\n", "\n", "geq_1_filter = sample >= -1\n", "leq_1_filter = sample <= 1\n", "sample[geq_1_filter & leq_1_filter].shape\n", "
\n", "\n", "#### Step 2: Calculating the probability\n", "Now compute the probability that a sample is between -1 and 1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does this calculated probability correspond with the theory?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparing two normal distributions\n", "\n", "We'll now compare two normal distributions with different means and variances. One of the distributions will be the standard normal distribution that we previously sampled.\n", "\n", "Take a sample of size 10,000 from the normal distribution with mean = 1 and standard deviation = 2.5." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the histogram of each sample on the same graph below. Remember to make the histograms transparent using the `alpha` parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now normalize each graph by adding the parameter `density = True`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How did the histograms change? \n", "\n", "Let's compute the probabilty that samples from the second normal distribution are within one standard deviation of the mean. Since the mean was 1 and the standard deviation was 2.5, we want the probabily that samples are between 1 - 2.5 = -1.5 and 1 + 2.5 = 3.5" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does this compare with the probabiliy that a value from the first sample (standard normal distribution) is in between -1 and 1?\n", "\n", "### Z-Score\n", "\n", "We can compute the *z-score* of a sample (from any normal distribution) by subtracting the mean of the sample and dividing by the standard deviation.\n", "\n", "Compute the mean and standard deviation of your second sample and save them as the variables `mean2` and `sd2`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Pattern:\n", "\n", "variable_name = sample_variable_name.mean()\n", "
\n", "\n", "We can compute the z-scores of all of sample two with the code `(sample2 - mean2)/sd2`. Try it below" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save the z-scores in a new variable, and plot them as a histogram." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happened?\n", "\n", "### Challenges:\n", "- What's the probability that a sample from the standard normal distribution is between -2 and 2?\n", "- Plot overlapping histograms to compare samples from the standard normal distribution with the normal distriubtion wtih mean = -1 and standard deviation 0.5.\n", "- What's the probability that samples from the normal distriubtion with mean = -1 and standard deviation 0.5 are within 2 standard deviations of the mean?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }